library(tidyverse)
library(data.table)
library(dplyr)
library(readxl)
library(caret)
library(magrittr)
ptm<-proc.time()
f<-"/home/bizon/Documents/nuMoM2b_Dataset_NICHD_Data_Challenge.csv"
mu_data_df <- fread(f)
Warning in fread(f) :
Detected 11717 column names but the data has 11633 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.
proc.time()-ptm
user system elapsed
7.186 0.442 4.052
I want to be able to look up quickly and select the correct predictor and response variables directly inside this R Studio environment. Eventually, this could be extended to allow variable selection to pass them on to “predictors” and “response” variables used in the various machine learning (ML) models below.
For now, we are simply manually selecting the clinically/medically meaningful predictors/response variables, check them for artifacts, clean up those artifacts as appropriate and proceed to ML modeling to test our hypotheses.
ptm<-proc.time()
f1_info<-"/home/bizon/Documents/mu2b/nuMoM2b_Dataset_Information.xlsx"
f2_code<-"/home/bizon/Documents/mu2b/nuMoM2b_Codebook_NICHD_Data_Challenge.xlsx"
mu_info_df <- read_excel(f1_info)
New names:
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* `` -> ...6
* ...
mu_code_df <- read_excel(f2_code)
proc.time()-ptm
user system elapsed
0.100 0.020 0.119
Now that we loaded our Information and Code spreadsheets, we can use DT library to view and search in them easily.
Note: it is best to open the output tables in a new window.
library(DT)
datatable(mu_code_df, filter = 'top', options = list(pageLength = 10, autoWidth = TRUE))
Warning in instance$preRenderHook(instance) :
It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html
Warning in instance$preRenderHook(instance) :
It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html
datatable(mu_info_df, filter = 'top', options = list(pageLength = 10, autoWidth = TRUE))
I need to remove NA values from CMAJ01.
Note: this step can be repeated as needed in the preprocessing pipeline for the desired response variable
# select "CMAJ01" | if NA, remove that row
# remove all rows where the column CMAJ01 has NA
clean_CMAJ01_df<-mu_data_df[!is.na(mu_data_df$CMAJ01), ]
# The above won't work on an h2o object, but works on a regular dataframe, so reading as df first and then, once preprocessing is done, moving on to h2o
#View(clean_CMAJ01_df$CMAJ01) #super fast Excel-like view of the data
glimpse(clean_CMAJ01_df$CMAJ01)
int [1:8774] 2 2 2 2 2 2 2 2 2 2 ...
count(clean_CMAJ01_df, CMAJ01)
nrow(clean_CMAJ01_df[mu_data_df$CMAJ01])
[1] 9289
sum(is.na(clean_CMAJ01_df$CMAJ01))
[1] 0
Confirming that we removed all NAs for CMAJ01:
glimpse: int [1:8774] 2 2 2 2 2 2 2 2 2 2 ...
count: [1] 9289
sum is.na: [1] 0
| Raw data cleaned with regard to NA values in CMAJ01, my response variable. Using this for modeling. |
+===============================================================================+ +——————————————————————————-+
CMAJ01 <int> |
n <int> |
|---|---|
| 1 | 154 |
| 2 | 8620 |
library(h2o)
h2o.init()
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/Rtmp7CbtxK/filed496a07bb37/h2o_bizon_started_from_r.out
/tmp/Rtmp7CbtxK/filed492544c91d/h2o_bizon_started_from_r.err
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)
Starting H2O JVM and connecting: .. Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 seconds 98 milliseconds
H2O cluster timezone: America/Vancouver
H2O data parsing timezone: UTC
H2O cluster version: 3.34.0.3
H2O cluster version age: 5 days
H2O cluster name: H2O_started_from_R_bizon_qvo599
H2O cluster total nodes: 1
H2O cluster total memory: 15.68 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 16
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 4.1.1 (2021-08-10)
Reading in the preprocessed dataset.
ptm<-proc.time()
mu_data <- as.h2o(clean_CMAJ01_df)
|
| | 0%
|
|=========================================================================================================| 100%
proc.time()-ptm
user system elapsed
8.194 0.597 16.368
maternal outcome: mortality (CMAE14): don’t have enough data to build a binary outcome prediction model | skip;
First, exploring the impact of demographics on readmission rate (CMAJ01) as a measure of morbidity:
mu_data[,"CMAJ01"]<-as.factor(mu_data[,"CMAJ01"])
# predictors "participant", "demographics" dataset (column A) includes the relevant features
# concludes the variables from "demographics" dataset; model as is and test later by adding participant features from the "CMA" set
predictors<-c("CRace",
"Race",
"eRace",
"eHispanic",
"BMI",
"BMI_Cat",
"Education",
"GravCat",
"SmokeCat1",
"SmokeCat2",
"SmokeCat3",
"Ins_Govt",
"Ins_Mil",
"Ins_Comm",
"Ins_Pers",
"Ins_Othr",
"PctFedPoverty",
"poverty"
) #makes a list of predicting variables
response<-"CMAJ01"
mu_data
[8776 rows x 11717 columns]
h2o.nrow(mu_data["CMAJ01"])
[1] 8776
h2o.group_by(data=mu_data, by="CMAJ01", nrow("CMAJ01"))
[3 rows x 2 columns]
In the raw dataset, we had 8786 cases of “CMAE14” are coded as 2 = not died, there are no cases to study this outcome.
In the raw dataset, we had for “CMAJ01”, we have a slightly better situation: 1, yes = 154 cases of readmission, 2, No, 8620 cases.
Needed to handle NaN and 4th row. This is best done on the R data frame which is what we did above in Preprocessing step.
head(mu_data, cache=TRUE)
# agg <- h2o.aggregator(training_frame = mu_data,
# target_num_exemplars = 9000,
# rel_tol_num_exemplars = 0.5,
# categorical_encoding = "Eigen")
# new_df<-h2o.aggregated_frame(agg)
# new_df
Aggregator destroys this dataframe. Skip this step.
(first, let’s do a “quick&dirty” way = no hold-out dataset to get some sense of the data from ML standpoint)
test_mother_mu_data_RFmodel<-h2o.randomForest(x=predictors,y=response,training_frame = mu_data, nfolds=10,seed = 1234)
|
| | 0%
|
|================= | 16%
|
|============================== | 28%
|
|========================================== | 40%
|
|======================================================== | 53%
|
|============================================================================================== | 89%
|
|======================================================================================================== | 99%
|
|=========================================================================================================| 100%
AUCpr: 0.9851791 for prediction of hospital readmission from demographics
test_mother_mu_data_RFmodel
Model Details:
==============
H2OBinomialModel: drf
Model ID: DRF_model_R_1634088199961_1359
Model Summary:
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.01951532
RMSE: 0.1396972
LogLoss: 0.2031171
Mean Per-Class Error: 0.5
AUC: 0.5781258
AUCPR: 0.9858659
Gini: 0.1562517
R^2: -0.1317303
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: drf
** Reported on cross-validation data. **
** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.0190441
RMSE: 0.1380004
LogLoss: 0.1444208
Mean Per-Class Error: 0.5
AUC: 0.570256
AUCPR: 0.9852251
Gini: 0.1405121
R^2: -0.1044036
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary:
That was a quick test.
Now I need to create a hold-out dataset first; repeat the above then as test:validation.
mother_mu_data_split <- h2o.splitFrame(mu_data, ratios=0.8, seed = 1)
train=mother_mu_data_split[[1]]
valid=mother_mu_data_split[[2]]
ptm<-proc.time()
mother_mu_data_RFmodel<-h2o.randomForest(x=predictors,y=response,training_frame = train, nfolds=10,seed = 1234)
|
| | 0%
|
|==================== | 19%
|
|================================ | 31%
|
|============================================ | 42%
|
|======================================================== | 53%
|
|============================================================================================= | 88%
|
|==================================================================================================== | 96%
|
|=========================================================================================================| 100%
proc.time()-ptm
user system elapsed
0.883 0.032 8.377
Yields very good AUCpr=0.984
ptm<-proc.time()
mother_mu_data_predict<-h2o.predict(object=mother_mu_data_RFmodel, newdata=valid)
|
| | 0%
|
|=========================================================================================================| 100%
mother_mu_data_RFmodel
Model Details:
==============
H2OBinomialModel: drf
Model ID: DRF_model_R_1634088199961_2019
Model Summary:
H2OBinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.01868367
RMSE: 0.1366882
LogLoss: 0.283439
Mean Per-Class Error: 0.5
AUC: 0.5435738
AUCPR: 0.9840289
Gini: 0.0871475
R^2: -0.1232257
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
H2OBinomialMetrics: drf
** Reported on cross-validation data. **
** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 0.01836745
RMSE: 0.1355266
LogLoss: 0.1634527
Mean Per-Class Error: 0.5
AUC: 0.5551044
AUCPR: 0.9843388
Gini: 0.1102087
R^2: -0.1042151
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary:
proc.time()-ptm
user system elapsed
0.056 0.004 0.089
mod=mother_mu_data_RFmodel
perf <- h2o.performance(mod,valid)
metrics <- as.data.frame(h2o.metric(perf))
metrics %>%
ggplot(aes(recall,precision)) +
geom_line() +
theme_minimal()
ptm<-proc.time()
# toggle progress bar if desired:
# h2o.show_progress()
exp <-h2o.explain(object=mother_mu_data_RFmodel, newdata=valid)
print(exp)
Confusion Matrix
================
> Confusion matrix shows a predicted class vs an actual class.
DRF_model_R_1634088199961_2019
------------------------------
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.559264203719795:
Variable Importance
===================
> The variable importance plot shows the relative importance of the most important variables in the model.
SHAP Summary
============
> SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
proc.time()-ptm
user system elapsed
57.874 3.947 82.542
results_df <- function(h2o_model) {
h2o_model@model$cross_validation_metrics_summary %>%
as.data.frame() %>%
select(-mean, -sd) %>%
t() %>%
as.data.frame() %>%
mutate_all(as.character) %>%
mutate_all(as.numeric) -> k
k %>%
select(Accuracy = accuracy,
prAUC = pr_auc,
Precision = precision,
Specificity = specificity,
Recall = recall,
Logloss = logloss) %>%
return()
}
# Using function
results_df(mod) -> outcome
# Outcome
outcome %>%
gather(Metrics, Values) %>%
ggplot(aes(Metrics, Values, fill = Metrics, color = Metrics)) +
geom_boxplot(alpha = 0.3, show.legend = FALSE) +
facet_wrap(~ Metrics, scales = "free") +
labs(title = "Performance of our ML model using H2o package ",
caption = "Data Source: NICHD Decoding Maternal Morbidity Data Challenge\nCreated by Martin Frasch (further credit to https://bit.ly/3BpPqcb)") +
theme_minimal()
# Statistics summary
outcome %>%
gather(Metrics, Values) %>%
group_by(Metrics) %>%
summarise_each(funs(mean, median, min, max, sd, n())) %>%
mutate_if(is.numeric, function(x) {round(100*x, 2)}) %>%
knitr::kable(col.names = c("Criterion", "Mean", "Median", "Min", "Max", "SD", "N"))
Warning: `summarise_each_()` was deprecated in dplyr 0.7.0.
Please use `across()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Warning: `funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
| Criterion | Mean | Median | Min | Max | SD | N |
|---|---|---|---|---|---|---|
| Accuracy | 98.30 | 98.24 | 97.83 | 99.03 | 0.34 | 1000 |
| Logloss | 16.63 | 14.30 | 9.15 | 30.47 | 7.38 | 1000 |
| prAUC | 98.46 | 98.51 | 97.55 | 99.14 | 0.52 | 1000 |
| Precision | 98.30 | 98.24 | 97.83 | 99.03 | 0.34 | 1000 |
| Recall | 100.00 | 100.00 | 100.00 | 100.00 | 0.00 | 1000 |
| Specificity | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1000 |
Using the
balance_classesterm available in this model allows to better control for the unbalanced data we are dealing with here, because the re-hospitalization rate is ~2%
maternal_dt_model<-h2o.gbm(x=predictors,y=response,training_frame = train, validation_frame = valid, balance_classes = TRUE, seed = 1234, nfolds=10)
# GBM hyperparamters
gbm_params = list(max_depth = seq(2, 10))
# Train and validate a cartesian grid of GBMs
gbm_grid = h2o.grid("gbm", x = predictors, y = response,
grid_id = "gbm_grid_1tree8",
training_frame = train,
validation_frame = valid,
balance_classes = TRUE,
ntrees = 1, min_rows = 1, sample_rate = 1, col_sample_rate = 1,
learn_rate = .01, seed = 1234,
hyper_params = gbm_params)
gbm_gridperf = h2o.getGrid(grid_id = "gbm_grid_1tree8",
sort_by = "auc",
decreasing = TRUE)
# what is the performance of this GBM?
maternal_dt_model
We obtain prAUC=0.98
gbm_gridperf
Inflection point is at max_depth=8
maternal_1_tree = h2o.gbm(x = predictors, y = response,
training_frame = train, balance_classes = TRUE,
ntrees = 1, min_rows = 1, sample_rate = 1, col_sample_rate = 1,
max_depth = 8,
# use early stopping once the validation AUC doesn't improve by at least 0.01%
# for 5 consecutive scoring events
stopping_rounds = 3, stopping_tolerance = 0.01,
stopping_metric = "AUC",
seed = 1)
maternal_1_tree
AUCPR: 0.882762
maternal_Tree = h2o.getModelTree(model = maternal_1_tree, tree_number = 1)
# Visualizing H2O Trees
library(data.tree)
createDataTree <- function(h2oTree) {
h2oTreeRoot = h2oTree@root_node
dataTree = Node$new(h2oTreeRoot@split_feature)
dataTree$type = 'split'
addChildren(dataTree, h2oTreeRoot)
return(dataTree)
}
addChildren <- function(dtree, node) {
if(class(node)[1] != 'H2OSplitNode') return(TRUE)
feature = node@split_feature
id = node@id
na_direction = node@na_direction
if(is.na(node@threshold)) {
leftEdgeLabel = printValues(node@left_levels, na_direction=='LEFT', 4)
rightEdgeLabel = printValues(node@right_levels, na_direction=='RIGHT', 4)
}else {
leftEdgeLabel = paste("<", node@threshold, ifelse(na_direction=='LEFT',',NA',''))
rightEdgeLabel = paste(">=", node@threshold, ifelse(na_direction=='RIGHT',',NA',''))
}
left_node = node@left_child
right_node = node@right_child
if(class(left_node)[[1]] == 'H2OLeafNode')
leftLabel = paste("prediction:", left_node@prediction)
else
leftLabel = left_node@split_feature
if(class(right_node)[[1]] == 'H2OLeafNode')
rightLabel = paste("prediction:", right_node@prediction)
else
rightLabel = right_node@split_feature
if(leftLabel == rightLabel) {
leftLabel = paste(leftLabel, "(L)")
rightLabel = paste(rightLabel, "(R)")
}
dtreeLeft = dtree$AddChild(leftLabel)
dtreeLeft$edgeLabel = leftEdgeLabel
dtreeLeft$type = ifelse(class(left_node)[1] == 'H2OSplitNode', 'split', 'leaf')
dtreeRight = dtree$AddChild(rightLabel)
dtreeRight$edgeLabel = rightEdgeLabel
dtreeRight$type = ifelse(class(right_node)[1] == 'H2OSplitNode', 'split', 'leaf')
addChildren(dtreeLeft, left_node)
addChildren(dtreeRight, right_node)
return(FALSE)
}
printValues <- function(values, is_na_direction, n=4) {
l = length(values)
if(l == 0)
value_string = ifelse(is_na_direction, "NA", "")
else
value_string = paste0(paste0(values[1:min(n,l)], collapse = ', '),
ifelse(l > n, ",...", ""),
ifelse(is_na_direction, ", NA", ""))
return(value_string)
}
This decision tree, also supplied as PDF, is meant to help build intuition about how the model.
library(DiagrammeR)
# customized DT for our H2O model
maternal_mu2DataTree = createDataTree(maternal_Tree)
GetEdgeLabel <- function(node) {return (node$edgeLabel)}
GetNodeShape <- function(node) {switch(node$type,
split = "diamond", leaf = "oval")}
GetFontName <- function(node) {switch(node$type,
split = 'Palatino-bold',
leaf = 'Palatino')}
SetEdgeStyle(maternal_mu2DataTree, fontname = 'Palatino-italic',
label = GetEdgeLabel, labelfloat = TRUE,
fontsize = "26", fontcolor='royalblue4')
SetNodeStyle(maternal_mu2DataTree, fontname = GetFontName, shape = GetNodeShape,
fontsize = "26", fontcolor='royalblue4',
height="0.75", width="1")
SetGraphStyle(maternal_mu2DataTree, rankdir = "LR", dpi=70.)
plot(maternal_mu2DataTree, output = "graph")
ptm<-proc.time()
exp_dt<-h2o.explain(maternal_dt_model,valid)
proc.time()-ptm
exp_dt
# Build and train the model:
mo2b_nb <- h2o.naiveBayes(x = predictors,
y = response,
training_frame = train,
laplace = 0,
nfolds = 10,
seed = 1234)
|
| | 0%
|
|======================================================================================================= | 91%
|
|=================================================================================================================| 100%
# Eval performance:
perf <- h2o.performance(mo2b_nb)
# Generate the predictions on a test set (if necessary):
pred <- h2o.predict(mo2b_nb, newdata = valid)
|
| | 0%
|
|=================================================================================================================| 100%
perf
H2OBinomialMetrics: naivebayes
** Reported on training data. **
MSE: 0.2641484
RMSE: 0.5139537
LogLoss: 1.244752
Mean Per-Class Error: 0.5
AUC: 0.6178245
AUCPR: 0.9877748
Gini: 0.235649
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
AUC: 0.6178245
AUCPR: 0.9877748
# best viewed in a new window or see, please, the PDF included with the submission
exp_nb <- h2o.explain(mo2b_nb,valid)
Warning: StackedEnsemble does not have a variable importance. Picking all columns. Set `columns` to a vector of columns to explain just a subset of columns.
Note the highly variable partial importance of the different socio-demographic characteristics
exp_nb
Confusion Matrix
================
> Confusion matrix shows a predicted class vs an actual class.
NaiveBayes_model_R_1634088199961_6290
-------------------------------------
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 4.35072885475268e-05:
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
*** recursive gc invocation
*** recursive gc invocation
*** recursive gc invocation
*** recursive gc invocation
*** recursive gc invocation
*** recursive gc invocation
*** recursive gc invocation
*** recursive gc invocation
Compare here. The findings are to be interpreted with caution at this stage. Once we obtain a larger external dataset for validation, with a more balanced case distribution, this will become more useful and allow building an inference engine that could be deployed for use. I am presenting this code therefore as a reference for future work.
Nevertheless, it is evident that an optimization even at this stage results in a marked (~60%) improvement of classification prediction performance, up to AUC 76%. This result can vary depending on the run.
Note please, this code runs for ~2 hours on a well-equipped deep learning workstation.
ptm<-proc.time()
maternal_aml <- h2o.automl(x=predictors,y=response,training_frame = train, max_models = 20, seed = 1)
|
| | 0%
|
|== | 2%
18:41:03.771: XGBoost_1_AutoML_2_20211012_184101 [XGBoost def_2] failed: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: XGBoost_1_AutoML_2_20211012_184101_cv_1. Details: ERRR on field: _response_column: Response contains missing values (NAs) - not supported by XGBoost.
|
|=== | 2%
|
|=== | 3%
|
|==== | 3%
|
|===== | 5%
|
|====== | 5%
|
|====== | 6%
18:46:52.731: XGBoost_2_AutoML_2_20211012_184101 [XGBoost def_1] failed: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: XGBoost_2_AutoML_2_20211012_184101_cv_1. Details: ERRR on field: _response_column: Response contains missing values (NAs) - not supported by XGBoost.
|
|========= | 8%
|
|========== | 8%
|
|========== | 9%
|
|============ | 10%
|
|============ | 11%
|
|============== | 12%
|
|============== | 13%
|
|=============== | 13%
|
|================= | 15%
|
|================== | 16%
|
|==================== | 18%
18:48:11.473: XGBoost_3_AutoML_2_20211012_184101 [XGBoost def_3] failed: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: XGBoost_3_AutoML_2_20211012_184101_cv_1. Details: ERRR on field: _response_column: Response contains missing values (NAs) - not supported by XGBoost.
|
|===================== | 19%
|
|====================== | 19%
|
|====================== | 20%
|
|======================= | 20%
|
|======================== | 21%
|
|========================= | 22%
|
|========================== | 23%
|
|============================ | 25%
|
|============================================= | 40%
|
|======================================================== | 50%
|
|============================================================== | 54%
|
|================================================================ | 57%
|
|====================================================================== | 62%
|
|=========================================================================== | 67%
|
|================================================================================ | 71%
19:11:11.167: StackedEnsemble_BestOfFamily_6_AutoML_2_20211012_184101 [StackedEnsemble best_of_family_xgboost (built with xgboost metalearner, using top model from each algorithm type)] failed: java.lang.RuntimeException: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: metalearner_xgboost_StackedEnsemble_BestOfFamily_6_AutoML_2_20211012_184101_cv_1. Details: ERRR on field: _response_column: Response contains missing values (NAs) - not supported by XGBoost.
|
|==================================================================================== | 74%
19:11:16.761: StackedEnsemble_AllModels_5_AutoML_2_20211012_184101 [StackedEnsemble all_xgboost (built with xgboost metalearner, using all AutoML models)] failed: java.lang.RuntimeException: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for XGBoost model: metalearner_xgboost_StackedEnsemble_AllModels_5_AutoML_2_20211012_184101_cv_1. Details: ERRR on field: _response_column: Response contains missing values (NAs) - not supported by XGBoost.
|
|======================================================================================= | 77%
|
|========================================================================================= | 79%
|
|=================================================================================================================| 100%
maternal_lb <- maternal_aml@leaderboard
#print(maternal_lb, n = nrow(maternal_lb)) #Print all rows instead of default 6 rows
proc.time()-ptm
user system elapsed
465.386 13.888 1852.840
ptm<-proc.time()
maternal_perf_valid <- h2o.performance(maternal_aml@leader,newdata=valid,xval=FALSE,valid=TRUE)
pred <- h2o.predict(maternal_aml@leader, valid)
h2o.auc(maternal_aml@leader)
# Using function
results_df(maternal_aml@leader) -> outcome
# Outcome
outcome %>%
gather(Metrics, Values) %>%
ggplot(aes(Metrics, Values, fill = Metrics, color = Metrics)) +
geom_boxplot(alpha = 0.3, show.legend = FALSE) +
facet_wrap(~ Metrics, scales = "free") +
labs(title = "Performance of the best AutoML model using H2o package ",
caption = "Data Source: NICHD Decoding Maternal Morbidity Data Challenge\nCreated by Martin Frasch (further credit to https://bit.ly/3BpPqcb)") +
theme_minimal()
We observe no specificity because the dataset is unbalanced such that by luck of draw (when the dataset is split 80:20) we get no true positives.
ptm<-proc.time()
exp <-h2o.explain(maternal_aml@leader, valid)
proc.time()-ptm
print(exp)
Test run from the demo site. Details on setup here.
Requires Python h2o4gpu module. Details of installation here.
library(h2o4gpu)
library(reticulate) # only needed if using a virtual Python environment or multiple Python versions
use_python("/home/bizon/anaconda3/bin/python3.6", required = TRUE)
#use_virtualenv("/home/bizon/h2o_ml") # set this to the path of your venv if any is used
# Setup dataset
x <- iris[1:4]
y <- as.integer(iris$Species) - 1
# Initialize and train the classifier
model <- h2o4gpu.random_forest_classifier() %>% fit(x, y)
# Other ML approaches are also available such as GBM, regression models, unsupervised learning
# Make predictions
predictions <- model %>% predict(x)
I am leaving this for future implementations. This is simply to point out the possibility to leverage GPU-equipped deep learning workstations to speed up the model building time on a dataset of this size.
=> See “postpartum depression” notebook for additional models.
sessionInfo()